INN Hotels ProjectΒΆ
ContextΒΆ
A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customersβ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
- Loss of resources (revenue) when the hotel cannot resell the room.
- Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
- Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
- Human resources to make arrangements for the guests.
ObjectiveΒΆ
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
Data DescriptionΒΆ
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
- Booking_ID: unique identifier of each booking
- no_of_adults: Number of adults
- no_of_children: Number of Children
- no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- type_of_meal_plan: Type of meal plan booked by the customer:
- Not Selected β No meal plan selected
- Meal Plan 1 β Breakfast
- Meal Plan 2 β Half board (breakfast and one other meal)
- Meal Plan 3 β Full board (breakfast, lunch, and dinner)
- required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
- room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
- lead_time: Number of days between the date of booking and the arrival date
- arrival_year: Year of arrival date
- arrival_month: Month of arrival date
- arrival_date: Date of the month
- market_segment_type: Market segment designation.
- repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
- no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
- no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
- avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
- no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
- booking_status: Flag indicating if the booking was canceled or not.
Importing necessary librariesΒΆ
# Installing the libraries with the specified version.
#!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.2 scikit-learn==1.2.2 statsmodels==0.14.1 -q
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# For feature scaling
from sklearn.preprocessing import StandardScaler
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
# To tune different models
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
f1_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
Import DatasetΒΆ
# Import data files
from google.colab import drive
drive.mount('/content/drive')
df_path = "/content/drive/MyDrive/DS Course/INNHotelsGroup.csv"
df = pd.read_csv(df_path)
# Making a copy to ease troubleshooting
hotel = df.copy()
Mounted at /content/drive
Data OverviewΒΆ
- Observations
- Sanity checks
def run_diagnostics(df, name="DataFrame"):
print(f"\nπ§ͺ Diagnostic Check on {name}")
# 1. Columns with all NaNs
na_cols = df.columns[df.isna().all()]
print(f"π¨ All-NaN columns: {na_cols.tolist()}")
# 2. Columns with only one unique value
constant_cols = [col for col in df.columns if df[col].nunique(dropna=True) == 1]
print(f"π§± Constant columns (no variance): {constant_cols}")
# 3. Columns with high NaN ratio (>20%)
high_na_cols = df.columns[(df.isna().mean() > 0.2)]
print(f"β οΈ Columns with high NaN ratio (>20%): {high_na_cols.tolist()}")
# 4. Summary of NaN counts (if any)
total_missing = df.isna().sum().sum()
print("\nπ NaN counts by column:")
print(df.isna().sum()[df.isna().sum() > 0])
# 5. Preview data types
print("\nπ’ Data type summary:")
print(df.dtypes.value_counts())
print("\nβ
Diagnostics complete.")
# Examine the dataframe to notice any data issues to correct
print(hotel.head())
print(hotel.tail())
print(hotel.shape)
print(hotel.info())
print(hotel.describe())
Booking_ID no_of_adults no_of_children no_of_weekend_nights \
0 INN00001 2 0 1
1 INN00002 2 0 2
2 INN00003 1 0 2
3 INN00004 2 0 0
4 INN00005 2 0 1
no_of_week_nights type_of_meal_plan required_car_parking_space \
0 2 Meal Plan 1 0
1 3 Not Selected 0
2 1 Meal Plan 1 0
3 2 Meal Plan 1 0
4 1 Not Selected 0
room_type_reserved lead_time arrival_year arrival_month arrival_date \
0 Room_Type 1 224 2017 10 2
1 Room_Type 1 5 2018 11 6
2 Room_Type 1 1 2018 2 28
3 Room_Type 1 211 2018 5 20
4 Room_Type 1 48 2018 4 11
market_segment_type repeated_guest no_of_previous_cancellations \
0 Offline 0 0
1 Online 0 0
2 Online 0 0
3 Online 0 0
4 Online 0 0
no_of_previous_bookings_not_canceled avg_price_per_room \
0 0 65.00000
1 0 106.68000
2 0 60.00000
3 0 100.00000
4 0 94.50000
no_of_special_requests booking_status
0 0 Not_Canceled
1 1 Not_Canceled
2 0 Canceled
3 0 Canceled
4 0 Canceled
Booking_ID no_of_adults no_of_children no_of_weekend_nights \
36270 INN36271 3 0 2
36271 INN36272 2 0 1
36272 INN36273 2 0 2
36273 INN36274 2 0 0
36274 INN36275 2 0 1
no_of_week_nights type_of_meal_plan required_car_parking_space \
36270 6 Meal Plan 1 0
36271 3 Meal Plan 1 0
36272 6 Meal Plan 1 0
36273 3 Not Selected 0
36274 2 Meal Plan 1 0
room_type_reserved lead_time arrival_year arrival_month \
36270 Room_Type 4 85 2018 8
36271 Room_Type 1 228 2018 10
36272 Room_Type 1 148 2018 7
36273 Room_Type 1 63 2018 4
36274 Room_Type 1 207 2018 12
arrival_date market_segment_type repeated_guest \
36270 3 Online 0
36271 17 Online 0
36272 1 Online 0
36273 21 Online 0
36274 30 Offline 0
no_of_previous_cancellations no_of_previous_bookings_not_canceled \
36270 0 0
36271 0 0
36272 0 0
36273 0 0
36274 0 0
avg_price_per_room no_of_special_requests booking_status
36270 167.80000 1 Not_Canceled
36271 90.95000 2 Canceled
36272 98.39000 2 Not_Canceled
36273 94.50000 0 Canceled
36274 161.67000 0 Not_Canceled
(36275, 19)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Booking_ID 36275 non-null object
1 no_of_adults 36275 non-null int64
2 no_of_children 36275 non-null int64
3 no_of_weekend_nights 36275 non-null int64
4 no_of_week_nights 36275 non-null int64
5 type_of_meal_plan 36275 non-null object
6 required_car_parking_space 36275 non-null int64
7 room_type_reserved 36275 non-null object
8 lead_time 36275 non-null int64
9 arrival_year 36275 non-null int64
10 arrival_month 36275 non-null int64
11 arrival_date 36275 non-null int64
12 market_segment_type 36275 non-null object
13 repeated_guest 36275 non-null int64
14 no_of_previous_cancellations 36275 non-null int64
15 no_of_previous_bookings_not_canceled 36275 non-null int64
16 avg_price_per_room 36275 non-null float64
17 no_of_special_requests 36275 non-null int64
18 booking_status 36275 non-null object
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
None
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
count 36275.00000 36275.00000 36275.00000 36275.00000
mean 1.84496 0.10528 0.81072 2.20430
std 0.51871 0.40265 0.87064 1.41090
min 0.00000 0.00000 0.00000 0.00000
25% 2.00000 0.00000 0.00000 1.00000
50% 2.00000 0.00000 1.00000 2.00000
75% 2.00000 0.00000 2.00000 3.00000
max 4.00000 10.00000 7.00000 17.00000
required_car_parking_space lead_time arrival_year arrival_month \
count 36275.00000 36275.00000 36275.00000 36275.00000
mean 0.03099 85.23256 2017.82043 7.42365
std 0.17328 85.93082 0.38384 3.06989
min 0.00000 0.00000 2017.00000 1.00000
25% 0.00000 17.00000 2018.00000 5.00000
50% 0.00000 57.00000 2018.00000 8.00000
75% 0.00000 126.00000 2018.00000 10.00000
max 1.00000 443.00000 2018.00000 12.00000
arrival_date repeated_guest no_of_previous_cancellations \
count 36275.00000 36275.00000 36275.00000
mean 15.59700 0.02564 0.02335
std 8.74045 0.15805 0.36833
min 1.00000 0.00000 0.00000
25% 8.00000 0.00000 0.00000
50% 16.00000 0.00000 0.00000
75% 23.00000 0.00000 0.00000
max 31.00000 1.00000 13.00000
no_of_previous_bookings_not_canceled avg_price_per_room \
count 36275.00000 36275.00000
mean 0.15341 103.42354
std 1.75417 35.08942
min 0.00000 0.00000
25% 0.00000 80.30000
50% 0.00000 99.45000
75% 0.00000 120.00000
max 58.00000 540.00000
no_of_special_requests
count 36275.00000
mean 0.61966
std 0.78624
min 0.00000
25% 0.00000
50% 0.00000
75% 1.00000
max 5.00000
# Check for duplicates
duplicate_rows = hotel.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())
# We don't need booking ID, so drop it
hotel.drop(columns=['Booking_ID'], inplace=True, errors='ignore')
Number of duplicate rows: 0
Exploratory Data Analysis (EDA)ΒΆ
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Leading Questions:
- What are the busiest months in the hotel?
- Which market segment do most of the guests come from?
- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
- What percentage of bookings are canceled?
- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
Univariate Analysis 1ΒΆ
# Set styles
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)
# Define numerical and categorical columns
numerical_cols = hotel.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = hotel.select_dtypes(include=['object']).columns
# --- Numerical Features ---
for col in numerical_cols:
fig, axs = plt.subplots(1, 2, figsize=(16, 4))
# Histogram
sns.histplot(hotel[col], kde=True, ax=axs[0], color='green')
axs[0].set_title(f'Distribution of {col}')
# Boxplot
sns.boxplot(x=hotel[col].dropna(), ax=axs[1], color='blue')
axs[1].set_title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()
# --- Categorical Features ---
for col in categorical_cols:
plt.figure(figsize=(10, 5))
sns.countplot(x=col, data=hotel, palette='viridis', order=hotel[col].value_counts().index)
plt.title(f'Count of {col}')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
π’ Numerical Features
- lead_time
- π Visual: Histogram shows right-skew β most values are low with a long tail.
- β Conclusion: Long lead times may signal higher cancellation risk.
- avg_price_per_room
- π Visual: Histogram shows a peak in the lower price range; some high-price outliers are visible in the boxplot.
- β Conclusion: Very high or low prices may influence cancellation behavior.
- no_of_previous_cancellations
- π Visual: Histogram and boxplot reveal most values at zero, with a few extreme cases.
- β Conclusion: Past cancellations are predictive of future cancellations.
- no_of_special_requests
- π Visual: Histogram shows a peak at 0β2; boxplot confirms low dispersion.
- β Conclusion: Guests with more requests may be more committed and less likely to cancel.
- required_car_parking_space
- π Visual: Count plot shows majority of guests do not require parking (mostly 0s).
- β Conclusion: Guests requesting parking may be more committed.
- repeated_guest
- π Visual: Count plot shows dominance of non-repeated guests (value = 0).
- β Conclusion: Repeated guests are more loyal and less likely to cancel.
- arrival_month / arrival_date
- π Visual: Histogram likely shows seasonal trends β spikes in summer or holidays.
- β Conclusion: Booking and cancellation behavior varies by season/month.
- π Categorical Features
- type_of_meal_plan
- π Visual: Count plot shows one meal plan (likely Meal Plan 1) dominates.
- β Conclusion: Guests selecting inclusive plans may be more committed.
- room_type_reserved
- π Visual: Count plot shows certain room types are reserved more frequently.
- β Conclusion: Certain room types may have different cancellation rates.
- market_segment_type
- π Visual: Market segment count plot shows dominance of online/offline bookings.
- β Conclusion: Online bookings may contribute more to cancellations.
- booking_status
- π Visual: Count plot reveals class balance β proportion of canceled vs. not canceled bookings.
- β Conclusion: We may need balanced metrics for modeling.
- type_of_meal_plan
Bivariate Analysis 1ΒΆ
# Set visual style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)
# Convert booking_status to binary for numeric analysis
hotel['booking_status_binary'] = hotel['booking_status'].map({'Not_Canceled': 0, 'Canceled': 1}).astype(int)
print(hotel['booking_status_binary'].value_counts(dropna=False))
# --- 1. Numerical Features vs Booking Status (Boxplots + KDE) ---
numerical_cols = hotel.select_dtypes(include=['int64', 'float64']).columns.drop('booking_status_binary')
for col in numerical_cols:
try:
plt.figure(figsize=(12, 5))
# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(data=hotel, x='booking_status', y=col, palette='Set2')
plt.title(f'{col} vs Booking Status (Boxplot)')
# KDE
plt.subplot(1, 2, 2)
sns.kdeplot(data=hotel, x=col, hue='booking_status', fill=True)
plt.title(f'{col} Distribution by Booking Status (KDE)')
plt.tight_layout()
plt.show()
except Exception as e:
print(f"β Skipping {col} due to error: {e}")
# --- 2. Categorical Features vs Booking Status (Countplots) ---
categorical_cols = hotel.select_dtypes(include=['object']).columns
for col in categorical_cols:
plt.figure(figsize=(10, 5))
sns.countplot(data=hotel, x=col, hue='booking_status', palette='Set2',
order=hotel[col].value_counts().index)
plt.title(f'{col} vs Booking Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# --- 3. Correlation Heatmap ---
plt.figure(figsize=(10, 6))
corr = hotel.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
booking_status_binary 0 24390 1 11885 Name: count, dtype: int64
π’ Numerical Features vs. Booking Status
- lead_time
- π Boxplot/KDE: Guests with longer lead times are more likely to cancel.
- β Conclusion: Longer lead time is a strong positive predictor of cancellation.
- avg_price_per_room
- π Boxplot: Slightly higher room prices observed for canceled bookings.
- β Conclusion: Guests paying higher prices may be more price-sensitive and prone to cancel.
- no_of_previous_cancellations
- π Boxplot: Guests who canceled in the past are more likely to cancel again.
- β Conclusion: Strong repeat behavior β a key feature for the predictive model.
- no_of_special_requests
- π Boxplot/KDE: Guests with more special requests are much less likely to cancel.
- β Conclusion: Indicates higher intent to stay β useful for flagging high-commitment guests.
- required_car_parking_space
- π Boxplot: Guests requesting parking tend to cancel less.
- β Conclusion: Local or driving guests may be more reliable.
- repeated_guest
- π Boxplot: Repeated guests are strongly skewed toward not canceling.
- β Conclusion: Loyalty matters β this group is much more reliable.
π Categorical Features vs. Booking Status
- type_of_meal_plan
- π Countplot: Bookings without a selected meal plan have a higher cancellation rate.
- β Conclusion: Meal selection may signal greater intent to stay.
- room_type_reserved
- π Countplot: Some room types have a higher cancellation rate than others.
- β Conclusion: Certain room types may attract more tentative bookings.
- market_segment_type
- π Countplot: Online and offline travel agents show higher cancellation proportions.
- β Conclusion: Market channel is a critical factor β OTA bookings are often less committed.
π Correlation Heatmap
- High positive correlation with cancellations:
- lead_time
- no_of_previous_cancellations
- High negative correlation:
- no_of_special_requests
- repeated_guest
- β Conclusion: These features are strong candidates for input into a machine learning model.
1. What are the busiest months in the hotel?ΒΆ
From the EDA distribution of arrival_month, the busiest months observed are October, followed closely by September and August.
The bar chart clearly shows a major spike in bookings during these three months, confirming the seasonal peak in late summer and early fall.
β October, September, and August are the hotel's busiest months based on actual booking counts.
2. Which market segment do most of the guests come from?ΒΆ
Analysis of the market_segment_type distribution shows that the Online market segment overwhelmingly contributes the highest number of bookings.
The "Offline" segment follows distantly, and Corporate, Complementary, and Aviation segments contribute much less.
β The Online market segment is the largest source of guests by a significant margin.
3. What are the differences in room prices in different market segments?ΒΆ
From the EDA of avg_price_per_room grouped by market_segment_type, Online guests have the highest average room price (β¬112).
Aviation and Offline segments also pay relatively higher prices (β¬91β100), while Complementary guests stay almost free (β¬3).
β Room prices vary notably across segments, with Online customers being the most valuable by price paid.
4. What percentage of bookings are canceled?ΒΆ
Reviewing the booking_status distribution shows that approximately 32.76% of bookings are marked as "Canceled."
This is visualized in the cancellation pie chart, where about one-third of all bookings do not materialize.
β Around one-third of all bookings end up being canceled.
5. What percentage of repeating guests cancel?ΒΆ
From the EDA subset where repeated_guest == 1, the cancellation rate among repeating guests is extremely low, at 1.72%.
Repeating guests show a remarkably high loyalty compared to new guests.
β Repeating guests are very reliable and have a much lower risk of cancellation.
6. Do special requests affect booking cancellation?ΒΆ
By analyzing no_of_special_requests vs. booking_status, we observe a clear trend:
- Guests with 0 special requests have the highest cancellation rate (~43%).
- Guests with 1 or 2 special requests cancel much less frequently.
- Guests with 3 or more special requests show no cancellations in the dataset.
β Guests making special requests are far less likely to cancel, and special requests can be seen as a commitment indicator.
Data PreprocessingΒΆ
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
# ------------------------------
# Missing Value Treatment
# ------------------------------
missing = hotel.isnull().sum()
print("Missing values:\n", missing[missing > 0]) # No missing values as confirmed earlier
# -----------------------------------
# Create a copy of the dataframe for the logistic regression, and another for the decision tree
# -----------------------------------
hotel_lr = hotel.copy()
hotel_dt = hotel.copy()
# -----------------------------------
# SHARED
# -----------------------------------
# Map binary target variable
target_mapping = {'Canceled': 0, 'Not_Canceled': 1}
hotel_lr['booking_status_binary'] = hotel_lr['booking_status'].map(target_mapping)
hotel_dt['booking_status_binary'] = hotel_dt['booking_status'].map(target_mapping)
# Feature Engineering (Shared)
for df in [hotel_lr, hotel_dt]:
df['total_nights'] = df['no_of_week_nights'] + df['no_of_weekend_nights']
# -----------------------------------
# hotel_lr: Preprocessing for Logistic Regression
# -----------------------------------
# Drop original target
hotel_lr.drop(columns=['booking_status'], inplace=True)
# One-hot encode all categorical predictors
categorical_cols_lr = hotel_lr.select_dtypes(include='object').columns
hotel_lr = pd.get_dummies(hotel_lr, columns=categorical_cols_lr, drop_first=True)
# Outlier Treatment for Logistic Regression
# Exclude columns that are binary, low-range, or identifiers
excluded_from_iqr = [
'no_of_adults', 'no_of_children', 'required_car_parking_space',
'arrival_year', 'repeated_guest', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled'
]
iqr_cols = hotel_lr.select_dtypes(include=['int64', 'float64']).columns.difference(
['booking_status_binary'] + excluded_from_iqr
)
for col in iqr_cols:
Q1 = hotel_lr[col].quantile(0.25)
Q3 = hotel_lr[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
hotel_lr[col] = np.where(hotel_lr[col] < lower, lower,
np.where(hotel_lr[col] > upper, upper, hotel_lr[col]))
# Feature scaling
scaler = StandardScaler()
X_lr = hotel_lr.drop(columns=['booking_status_binary'])
X_lr_scaled = pd.DataFrame(scaler.fit_transform(X_lr), columns=X_lr.columns)
y_lr = hotel_lr['booking_status_binary']
# Split data
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
X_lr_scaled, y_lr, test_size=0.2, random_state=42, stratify=y_lr
)
run_diagnostics(hotel_lr)
Missing values: Series([], dtype: int64) π§ͺ Diagnostic Check on DataFrame π¨ All-NaN columns: [] π§± Constant columns (no variance): [] β οΈ Columns with high NaN ratio (>20%): [] π NaN counts by column: Series([], dtype: int64) π’ Data type summary: bool 13 int64 8 float64 8 Name: count, dtype: int64 β Diagnostics complete.
# -----------------------------------
# hotel_dt: Preprocessing for Decision Tree
# -----------------------------------
# We convert these to categorical because: reduces memory usage, and makes it
# easier to identify categorical features for preprocessing or visualization.
for col in hotel.select_dtypes(include='object').columns:
print(f"Converting {col} to categoric.")
hotel[col] = hotel[col].astype('category')
# Verify the conversion
print(hotel.info())
run_diagnostics(hotel)
# Define ordinal mappings for tree model
ordinal_mappings = {
'type_of_meal_plan': {
'Not Selected': 0, 'Meal Plan 1': 1, 'Meal Plan 2': 2, 'Meal Plan 3': 3
},
'room_type_reserved': {
'Room_Type 1': 1, 'Room_Type 2': 2, 'Room_Type 3': 3,
'Room_Type 4': 4, 'Room_Type 5': 5, 'Room_Type 6': 6, 'Room_Type 7': 7
},
'market_segment_type': {
'Complementary': 0, 'Offline': 1, 'Online': 2, 'Corporate': 3, 'Aviation': 4
}
}
# Apply ordinal mapping for decision tree
for col, mapping in ordinal_mappings.items():
if col in hotel_dt.columns:
hotel_dt[col] = hotel_dt[col].map(mapping).fillna(-1).astype(int)
# Drop original target
hotel_dt.drop(columns=['booking_status'], inplace=True)
# One-hot encode any remaining nominal features (if any)
ordinal_cols = list(ordinal_mappings.keys())
nominal_cols_dt = hotel_dt.select_dtypes(include='object').columns.difference(ordinal_cols)
hotel_dt = pd.get_dummies(hotel_dt, columns=nominal_cols_dt, drop_first=True)
# No scaling or outlier treatment for trees
X_dt = hotel_dt.drop(columns=['booking_status_binary'])
y_dt = hotel_dt['booking_status_binary']
# Split data
X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(
X_dt, y_dt, test_size=0.2, random_state=42, stratify=y_dt
)
run_diagnostics(hotel_dt)
print("β
Preprocessing complete for both Logistic Regression and Decision Tree models.")
Converting type_of_meal_plan to categoric. Converting room_type_reserved to categoric. Converting market_segment_type to categoric. Converting booking_status to categoric. <class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 type_of_meal_plan 36275 non-null category 5 required_car_parking_space 36275 non-null int64 6 room_type_reserved 36275 non-null category 7 lead_time 36275 non-null int64 8 arrival_year 36275 non-null int64 9 arrival_month 36275 non-null int64 10 arrival_date 36275 non-null int64 11 market_segment_type 36275 non-null category 12 repeated_guest 36275 non-null int64 13 no_of_previous_cancellations 36275 non-null int64 14 no_of_previous_bookings_not_canceled 36275 non-null int64 15 avg_price_per_room 36275 non-null float64 16 no_of_special_requests 36275 non-null int64 17 booking_status 36275 non-null category 18 booking_status_binary 36275 non-null int64 dtypes: category(4), float64(1), int64(14) memory usage: 4.3 MB None π§ͺ Diagnostic Check on DataFrame π¨ All-NaN columns: [] π§± Constant columns (no variance): [] β οΈ Columns with high NaN ratio (>20%): [] π NaN counts by column: Series([], dtype: int64) π’ Data type summary: int64 14 category 1 category 1 category 1 float64 1 category 1 Name: count, dtype: int64 β Diagnostics complete. π§ͺ Diagnostic Check on DataFrame π¨ All-NaN columns: [] π§± Constant columns (no variance): [] β οΈ Columns with high NaN ratio (>20%): [] π NaN counts by column: Series([], dtype: int64) π’ Data type summary: int64 18 float64 1 Name: count, dtype: int64 β Diagnostics complete. β Preprocessing complete for both Logistic Regression and Decision Tree models.
Checking Logistic Regression AssumptionsΒΆ
Linearity of Log Odds:
For continuous variables, we assume a linear relationship between predictors and the log odds of the outcome.
Although not formally tested here, the impact of numeric predictors (e.g., lead_time, avg_price_per_room) appears reasonable based on coefficient signs.No Perfect Multicollinearity:
Variance Inflation Factor (VIF) was checked, and high multicollinearity features were removed or adjusted as necessary.
Remaining features have acceptable VIF scores (< 5 threshold).Independence of Observations:
Each booking record is assumed to be independent of others (no repeated measures).
β These assumptions are reasonably met to proceed with Logistic Regression modeling.
Functions to help us evaluate the decision tree models we createΒΆ
## Function to create confusion matrix
def make_confusion_matrix(y_pred,y_test,labels=[1, 0]):
cm=metrics.confusion_matrix( y_test, y_pred, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix")
def evaluate_model_performance(y_train, y_train_pred, y_test, y_test_pred):
metrics_dict = {
'Precision': [
precision_score(y_train, y_train_pred, average='binary'),
precision_score(y_test, y_test_pred, average='binary')
],
'Accuracy': [
accuracy_score(y_train, y_train_pred),
accuracy_score(y_test, y_test_pred)
],
'Recall': [
recall_score(y_train, y_train_pred, average='binary'),
recall_score(y_test, y_test_pred, average='binary')
],
'F1 Score': [
f1_score(y_train, y_train_pred, average='binary'),
f1_score(y_test, y_test_pred, average='binary')
]
}
results_df = pd.DataFrame(metrics_dict, index=['Train', 'Test']).T
print("\nπ Model Performance Summary:")
display(results_df.style.format("{:.4f}"))
def show_decision_tree(model, X_train):
plt.figure(figsize=(20,10))
tree.plot_tree(model, filled=True, feature_names=X_train.columns,
class_names=['Not_Canceled', 'Canceled'], fontsize=8)
plt.title("Decision Tree")
plt.show()
# Visualize the importance of each feature
def feature_visualization(model, X):
importances = model.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12, 12))
plt.title('Feature Importances')
# Draw bars
bars = plt.barh(range(len(indices)), importances[indices],
color='violet', align='center')
# Label each bar with its importance value
for i, bar in enumerate(bars):
width = bar.get_width()
plt.text(width + 0.005, # X position (right of bar)
bar.get_y() + bar.get_height() / 2, # Y position (centered)
f"{width:.2f}", # Value label
va='center')
# Y-axis labels and formatting
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.tight_layout()
plt.show()
Building a Logistic Regression modelΒΆ
# Fit a logistic regression model on our hotel_lr data
logit_model = sm.Logit(y_train_lr, X_train_lr)
logit_result = logit_model.fit()
print(logit_result.summary())
Optimization terminated successfully.
Current function value: 0.511048
Iterations 6
Logit Regression Results
=================================================================================
Dep. Variable: booking_status_binary No. Observations: 29020
Model: Logit Df Residuals: 28992
Method: MLE Df Model: 27
Date: Sun, 27 Apr 2025 Pseudo R-squ.: 0.1920
Time: 17:10:28 Log-Likelihood: -14831.
converged: True LL-Null: -18355.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_adults -0.0376 0.016 -2.296 0.022 -0.070 -0.006
no_of_children -0.0648 0.020 -3.266 0.001 -0.104 -0.026
no_of_weekend_nights -0.5702 0.060 -9.504 0.000 -0.688 -0.453
no_of_week_nights -0.7538 0.090 -8.402 0.000 -0.930 -0.578
required_car_parking_space 0.1809 0.016 11.347 0.000 0.150 0.212
lead_time -1.2528 0.020 -62.261 0.000 -1.292 -1.213
arrival_year -0.1384 0.017 -8.007 0.000 -0.172 -0.105
arrival_month 0.0450 0.016 2.843 0.004 0.014 0.076
arrival_date -0.0322 0.014 -2.254 0.024 -0.060 -0.004
repeated_guest 0.0185 0.021 0.871 0.384 -0.023 0.060
no_of_previous_cancellations -0.0027 0.016 -0.164 0.870 -0.035 0.029
no_of_previous_bookings_not_canceled -0.0391 0.021 -1.878 0.060 -0.080 0.002
avg_price_per_room -0.4398 0.020 -22.023 0.000 -0.479 -0.401
no_of_special_requests 0.8954 0.018 50.314 0.000 0.861 0.930
total_nights 0.9532 0.111 8.577 0.000 0.735 1.171
type_of_meal_plan_Meal Plan 2 -0.0618 0.017 -3.739 0.000 -0.094 -0.029
type_of_meal_plan_Meal Plan 3 -0.0125 0.013 -0.978 0.328 -0.038 0.013
type_of_meal_plan_Not Selected -0.0366 0.016 -2.340 0.019 -0.067 -0.006
room_type_reserved_Room_Type 2 0.0454 0.016 2.926 0.003 0.015 0.076
room_type_reserved_Room_Type 3 0.0015 0.016 0.100 0.921 -0.029 0.032
room_type_reserved_Room_Type 4 0.0695 0.017 4.184 0.000 0.037 0.102
room_type_reserved_Room_Type 5 0.0490 0.014 3.457 0.001 0.021 0.077
room_type_reserved_Room_Type 6 0.0738 0.020 3.666 0.000 0.034 0.113
room_type_reserved_Room_Type 7 0.0249 0.016 1.595 0.111 -0.006 0.056
market_segment_type_Complementary -0.0513 0.029 -1.798 0.072 -0.107 0.005
market_segment_type_Corporate 0.0983 0.051 1.925 0.054 -0.002 0.198
market_segment_type_Offline 0.5495 0.099 5.532 0.000 0.355 0.744
market_segment_type_Online -0.1466 0.104 -1.406 0.160 -0.351 0.058
========================================================================================================
Model performance evaluationΒΆ
def view_lr_results(y_true, y_pred, set_name="Training Set"):
"""
Displays confusion matrix and accuracy score for logistic regression results.
Parameters:
y_true (array-like): Ground truth values.
y_pred (array-like): Predicted probabilities or binary predictions.
set_name (str): Label for which dataset is being evaluated (e.g., 'Training Set', 'Test Set').
"""
# Round predictions if they are probabilities
y_pred_binary = y_pred.round()
# Confusion matrix
cm = confusion_matrix(y_true, y_pred_binary)
print(f"π Confusion Matrix ({set_name}):\n", cm)
# Heatmap
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g", cmap="Blues")
plt.title(f"Confusion Matrix ({set_name})")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Accuracy
accuracy = accuracy_score(y_true, y_pred_binary)
print(f"β
Accuracy on {set_name}: {accuracy:.4f}")
# predicting on training set
y_pred_train_lr = logit_result.predict(X_train_lr)
view_lr_results(y_train_lr, y_pred_train_lr, "Training Set")
# predicting on test set
y_pred_test_lr = logit_result.predict(X_test_lr)
view_lr_results(y_test_lr, y_pred_test_lr, "Test Set")
π Confusion Matrix (Training Set): [[ 7907 1601] [ 5375 14137]]
β Accuracy on Training Set: 0.7596 π Confusion Matrix (Test Set): [[1979 398] [1315 3563]]
β Accuracy on Test Set: 0.7639
Checking MulticollinearityΒΆ
- In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
# Multicollinearity is only a concern for Logistic Regression
# Correlation matrix (for numeric features)
plt.figure(figsize=(14, 10))
corr_matrix = X_lr.corr()
# Visualize with a heatmap
sns.heatmap(corr_matrix, annot=True, fmt=".1f", cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()
# -----------------------------------
# Calculate VIF for each feature
# -----------------------------------
def calculate_vif(X):
"""Returns a DataFrame with VIF scores for each feature."""
X_with_const = add_constant(X, has_constant='add')
vif = pd.DataFrame()
vif["Feature"] = X_with_const.columns
vif["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
return vif.drop(index=0) # Drop constant term
new_X_lr = X_lr.copy().astype(float)
vif_data = calculate_vif(new_X_lr)
print('Initial VIF scores:')
print(vif_data.sort_values(by='VIF', ascending=False))
while True:
# Identify feature with highest VIF
max_vif_feature = vif_data.sort_values("VIF", ascending=False).iloc[0]
feature_to_drop = max_vif_feature["Feature"]
print(f"β οΈ Dropping '{feature_to_drop}' with VIF = {max_vif_feature['VIF']:.2f}")
# Drop the feature with highest VIF
new_X_lr.drop(columns=[feature_to_drop], inplace=True)
vif_data = calculate_vif(new_X_lr)
# Check if any VIFs exceed the threshold
if vif_data[vif_data["VIF"] > 5].empty:
print("Treated VIF scores:")
print(vif_data.sort_values(by='VIF', ascending=False))
break
# Fit logistic regression with remaining features
X_train_lr_2, X_test_lr_2, y_train_lr_2, y_test_lr_2 = train_test_split(
new_X_lr, y_lr, test_size=0.2, random_state=42, stratify=y_lr
)
lg_result_2 = sm.Logit(y_train_lr_2, add_constant(X_train_lr_2)).fit(
method='bfgs', maxiter=100)
print(lg_result_2.summary())
# predicting on training set
y_pred_train_lr_2 = lg_result_2.predict(add_constant(X_train_lr_2))
view_lr_results(y_train_lr_2, y_pred_train_lr_2, "Training Set")
# predicting on test set
y_pred_test_lr_2 = lg_result_2.predict(add_constant(X_test_lr_2))
view_lr_results(y_test_lr_2, y_pred_test_lr_2, "Test Set")
Initial VIF scores:
Feature VIF
28 market_segment_type_Online 69.89901
27 market_segment_type_Offline 63.00283
15 total_nights 42.32040
4 no_of_week_nights 27.06096
26 market_segment_type_Corporate 16.63185
3 no_of_weekend_nights 11.87819
25 market_segment_type_Complementary 4.36189
2 no_of_children 1.99312
13 avg_price_per_room 1.95197
23 room_type_reserved_Room_Type 6 1.94818
10 repeated_guest 1.76396
12 no_of_previous_bookings_not_canceled 1.61421
7 arrival_year 1.42654
6 lead_time 1.38823
21 room_type_reserved_Room_Type 4 1.37230
11 no_of_previous_cancellations 1.35200
1 no_of_adults 1.35003
18 type_of_meal_plan_Not Selected 1.28428
8 arrival_month 1.27242
14 no_of_special_requests 1.26235
16 type_of_meal_plan_Meal Plan 2 1.25502
19 room_type_reserved_Room_Type 2 1.09490
24 room_type_reserved_Room_Type 7 1.08962
5 required_car_parking_space 1.03643
22 room_type_reserved_Room_Type 5 1.02982
17 type_of_meal_plan_Meal Plan 3 1.01801
9 arrival_date 1.00694
20 room_type_reserved_Room_Type 3 1.00210
β οΈ Dropping 'market_segment_type_Online' with VIF = 69.90
β οΈ Dropping 'total_nights' with VIF = 42.00
Treated VIF scores:
Feature VIF
2 no_of_children 1.99229
13 avg_price_per_room 1.94937
22 room_type_reserved_Room_Type 6 1.94788
10 repeated_guest 1.76006
26 market_segment_type_Offline 1.62843
12 no_of_previous_bookings_not_canceled 1.61379
25 market_segment_type_Corporate 1.53064
7 arrival_year 1.42370
6 lead_time 1.38205
20 room_type_reserved_Room_Type 4 1.36681
11 no_of_previous_cancellations 1.35185
1 no_of_adults 1.32768
24 market_segment_type_Complementary 1.28551
17 type_of_meal_plan_Not Selected 1.27938
8 arrival_month 1.27117
14 no_of_special_requests 1.25620
15 type_of_meal_plan_Meal Plan 2 1.25397
4 no_of_week_nights 1.09653
18 room_type_reserved_Room_Type 2 1.09448
23 room_type_reserved_Room_Type 7 1.08951
3 no_of_weekend_nights 1.05454
5 required_car_parking_space 1.03630
21 room_type_reserved_Room_Type 5 1.02979
16 type_of_meal_plan_Meal Plan 3 1.01801
9 arrival_date 1.00661
19 room_type_reserved_Room_Type 3 1.00209
Current function value: 0.428039
Iterations: 100
Function evaluations: 109
Gradient evaluations: 104
Logit Regression Results
=================================================================================
Dep. Variable: booking_status_binary No. Observations: 29020
Model: Logit Df Residuals: 28993
Method: MLE Df Model: 26
Date: Sun, 27 Apr 2025 Pseudo R-squ.: 0.3232
Time: 17:10:40 Log-Likelihood: -12422.
converged: False LL-Null: -18355.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 0.0139 109.057 0.000 1.000 -213.734 213.762
no_of_adults -0.0383 0.035 -1.098 0.272 -0.107 0.030
no_of_children -0.2018 0.054 -3.743 0.000 -0.307 -0.096
no_of_weekend_nights -0.1435 0.018 -7.824 0.000 -0.179 -0.108
no_of_week_nights -0.0049 0.013 -0.385 0.700 -0.030 0.020
required_car_parking_space 1.7005 0.132 12.879 0.000 1.442 1.959
lead_time -0.0167 0.000 -64.998 0.000 -0.017 -0.016
arrival_year 0.0014 0.054 0.026 0.979 -0.105 0.107
arrival_month 0.0602 0.006 9.829 0.000 0.048 0.072
arrival_date -0.0022 0.002 -1.236 0.216 -0.006 0.001
repeated_guest 1.7481 0.416 4.199 0.000 0.932 2.564
no_of_previous_cancellations -0.2583 0.068 -3.818 0.000 -0.391 -0.126
no_of_previous_bookings_not_canceled 0.1344 0.100 1.344 0.179 -0.062 0.330
avg_price_per_room -0.0207 0.001 -28.258 0.000 -0.022 -0.019
no_of_special_requests 1.4884 0.028 52.852 0.000 1.433 1.544
type_of_meal_plan_Meal Plan 2 -0.0859 0.061 -1.418 0.156 -0.205 0.033
type_of_meal_plan_Meal Plan 3 -0.1058 2.071 -0.051 0.959 -4.165 3.954
type_of_meal_plan_Not Selected -0.2593 0.050 -5.212 0.000 -0.357 -0.162
room_type_reserved_Room_Type 2 0.4860 0.125 3.882 0.000 0.241 0.731
room_type_reserved_Room_Type 3 0.0252 1.324 0.019 0.985 -2.570 2.620
room_type_reserved_Room_Type 4 0.2536 0.050 5.087 0.000 0.156 0.351
room_type_reserved_Room_Type 5 0.9052 0.193 4.693 0.000 0.527 1.283
room_type_reserved_Room_Type 6 0.8829 0.137 6.459 0.000 0.615 1.151
room_type_reserved_Room_Type 7 0.9886 0.290 3.415 0.001 0.421 1.556
market_segment_type_Complementary 1.3980 0.585 2.391 0.017 0.252 2.544
market_segment_type_Corporate 0.8613 0.096 9.009 0.000 0.674 1.049
market_segment_type_Offline 1.8250 0.049 37.242 0.000 1.729 1.921
========================================================================================================
π Confusion Matrix (Training Set):
[[ 5926 3582]
[ 2214 17298]]
β Accuracy on Training Set: 0.8003 π Confusion Matrix (Test Set): [[1506 871] [ 523 4355]]
β Accuracy on Test Set: 0.8079
# Treat high p-values
def refine_logit_by_pval(model_result, X_train, X_test, y_train, y_test,
threshold=0.05, iteration=1):
# Extract p-values (excluding constant)
pvalues = model_result.pvalues.drop("const", errors="ignore")
# Base case: all p-values are below the threshold
if all(pvalues <= threshold):
print("β
All p-values β€ threshold.")
return model_result, X_train, X_test
# Identify and drop feature with highest p-value
feature_to_drop = pvalues.idxmax()
print(f"β οΈ Dropping '{feature_to_drop}' with p-value = {pvalues.max():.4f}")
# Drop from training and test sets
X_train = X_train.drop(columns=[feature_to_drop])
X_test = X_test.drop(columns=[feature_to_drop])
# Refit model with updated features
X_train_const = add_constant(X_train, has_constant='add')
new_model = sm.Logit(y_train, X_train_const).fit(method='bfgs', maxiter=100, disp=False)
# Recurse
return refine_logit_by_pval(new_model, X_train, X_test, y_train, y_test, threshold, iteration + 1)
# Run refinement
lg_result_3, X_train_lr_3, X_test_lr_3 = refine_logit_by_pval(
lg_result_2,
X_train_lr_2,
X_test_lr_2,
y_train_lr_2,
y_test_lr_2
)
# Add constant columns
X_train_const_3 = add_constant(X_train_lr_3, has_constant='add')
X_test_const_3 = add_constant(X_test_lr_3, has_constant='add')
# Predict
y_pred_train_3 = lg_result_3.predict(X_train_const_3).round()
y_pred_test_3 = lg_result_3.predict(X_test_const_3).round()
# Accuracy scores
acc_train_3 = accuracy_score(y_train_lr_2, y_pred_train_3)
acc_test_3 = accuracy_score(y_test_lr_2, y_pred_test_3)
# Print results
print(f"β
Training Accuracy: {acc_train_3:.4f}")
print(f"β
Test Accuracy: {acc_test_3:.4f}")
# Score our model
evaluate_model_performance(y_train_lr_2, y_pred_train_3, y_test_lr_2, y_pred_test_3)
make_confusion_matrix(y_pred_test_3, y_test_lr_2)
print(lg_result_3.summary())
β οΈ Dropping 'room_type_reserved_Room_Type 3' with p-value = 0.9848 β οΈ Dropping 'arrival_year' with p-value = 0.9792 β οΈ Dropping 'type_of_meal_plan_Meal Plan 3' with p-value = 0.9603 β οΈ Dropping 'no_of_week_nights' with p-value = 0.5883 β οΈ Dropping 'no_of_adults' with p-value = 0.2665 β οΈ Dropping 'no_of_previous_bookings_not_canceled' with p-value = 0.2691 β οΈ Dropping 'arrival_date' with p-value = 0.2043 β οΈ Dropping 'type_of_meal_plan_Meal Plan 2' with p-value = 0.0548 β All p-values β€ threshold. β Training Accuracy: 0.7991 β Test Accuracy: 0.8065 π Model Performance Summary:
| Β | Train | Test |
|---|---|---|
| Precision | 0.8278 | 0.8328 |
| Accuracy | 0.7991 | 0.8065 |
| Recall | 0.8854 | 0.8911 |
| F1 Score | 0.8556 | 0.8610 |
Logit Regression Results
=================================================================================
Dep. Variable: booking_status_binary No. Observations: 29020
Model: Logit Df Residuals: 29001
Method: MLE Df Model: 18
Date: Sun, 27 Apr 2025 Pseudo R-squ.: 0.3232
Time: 17:10:46 Log-Likelihood: -12423.
converged: False LL-Null: -18355.
Covariance Type: nonrobust LLR p-value: 0.000
=====================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const 2.7658 0.087 31.677 0.000 2.595 2.937
no_of_children -0.1698 0.054 -3.156 0.002 -0.275 -0.064
no_of_weekend_nights -0.1438 0.018 -7.897 0.000 -0.180 -0.108
required_car_parking_space 1.7537 0.134 13.113 0.000 1.492 2.016
lead_time -0.0168 0.000 -69.722 0.000 -0.017 -0.016
arrival_month 0.0610 0.006 10.899 0.000 0.050 0.072
repeated_guest 2.2616 0.415 5.445 0.000 1.448 3.076
no_of_previous_cancellations -0.2063 0.065 -3.173 0.002 -0.334 -0.079
avg_price_per_room -0.0209 0.001 -30.606 0.000 -0.022 -0.020
no_of_special_requests 1.4821 0.028 53.164 0.000 1.427 1.537
type_of_meal_plan_Not Selected -0.2607 0.049 -5.347 0.000 -0.356 -0.165
room_type_reserved_Room_Type 2 0.4379 0.125 3.511 0.000 0.193 0.682
room_type_reserved_Room_Type 4 0.2379 0.048 4.947 0.000 0.144 0.332
room_type_reserved_Room_Type 5 0.7955 0.190 4.179 0.000 0.422 1.169
room_type_reserved_Room_Type 6 0.8302 0.136 6.105 0.000 0.564 1.097
room_type_reserved_Room_Type 7 0.9189 0.287 3.203 0.001 0.357 1.481
market_segment_type_Complementary 1.4353 0.593 2.419 0.016 0.272 2.598
market_segment_type_Corporate 0.9317 0.095 9.770 0.000 0.745 1.119
market_segment_type_Offline 1.8076 0.047 38.704 0.000 1.716 1.899
=====================================================================================================
The accuracy of our model did not drop significantly for either train or test data after removing p-values. Now all the columns left are significant predictors, let's check the model performance and make interpretations.
Final Model SummaryΒΆ
π» Features That Decrease the Odds of Cancellation These features have negative values, meaning a 1-unit increase lowers the likelihood of a booking being canceled:
no_of_children:β 15.6%
- Families are less likely to cancel β possibly more committed.
no_of_weekend_nights: β 13.4%
- Guests staying over the weekend are less likely to cancel (leisure trips).
lead_time: β 1.7%
- Slight decrease β guests booking farther in advance may be more intentional.
avg_price_per_room: β 2.1%
- Guests paying more are slightly less likely to cancel β more investment.
type_of_meal_plan_Not Selected: β 23.0%
- Budget-conscious guests without meal plans may be more committed.
no_of_previous_cancellations: β 18.6%
- Counterintuitive; could indicate a learned behavior or model artifact.
πΊ Features That Increase the Odds of Cancellation These features have positive values, meaning a 1-unit increase raises the odds of cancellation:
required_car_parking_space: β 477.6%
- Strong signal β could relate to uncertainty in travel logistics or overfitting.
repeated_guest: β 859.8%
- Surprisingly high β may indicate bulk/group/corporate booking behavior.
no_of_special_requests: β 340.2%
- Guests with many requests might be more likely to cancel (picky or risk-averse).
Room types:
- room_type_reserved_Room_Type 2: β 54.9%
- room_type_reserved_Room_Type 4: β 26.9%
- room_type_reserved_Room_Type 5: β 121.6%
- room_type_reserved_Room_Type 6: β 129.4%
- room_type_reserved_Room_Type 7: β 150.6%
- Guests choosing less common room types may face availability or flexibility issues.
Market segments:
- market_segment_type_Complementary: β 320.1%
- market_segment_type_Corporate: β 153.9%
- market_segment_type_Offline: β 509.6%
- These channels are far more likely to cancel β complimentary bookings and offline channels are especially risky.
β Summary Takeaways
- Decreased risk: Families, weekend stays, and budget-conscious guests cancel less often.
- Increased risk: Complimentary, offline, and corporate bookings are red flags.
- Operational Insight: Room type and special request patterns could help refine cancellation policies.
Building a Decision Tree modelΒΆ
# Initialize the model
dt_model = DecisionTreeClassifier(
criterion='gini',
max_depth=None,
min_samples_split=2,
random_state=42
)
# Train the model
dt_model.fit(X_train_dt, y_train_dt)
# Score our model
y_test_pred = dt_model.predict(X_test_dt)
y_train_pred = dt_model.predict(X_train_dt)
evaluate_model_performance(y_train_dt, y_train_pred, y_test_dt, y_test_pred)
make_confusion_matrix(y_test_pred,y_test_dt)
# Visualize the tree
show_decision_tree(dt_model, X_train_dt)
π Model Performance Summary:
| Β | Train | Test |
|---|---|---|
| Precision | 0.9962 | 0.9100 |
| Accuracy | 0.9942 | 0.8754 |
| Recall | 0.9952 | 0.9041 |
| F1 Score | 0.9957 | 0.9070 |
We can see several things from our model performance summary. While our scores on our test set are quite high, they're not as high as on our training set, which means our model is overfit (as expected).
Additionally, we need to prune the tree as it is much too large to interpret.
This model needs to be further refined.
feature_visualization(dt_model, X_dt)
Do we need to prune the tree?ΒΆ
Let's try pre-pruning to see the effectsΒΆ
dt_model_1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dt_model_1.fit(X_train_dt, y_train_dt)
# Evaluate the Model
y_test_pred_1 = dt_model_1.predict(X_test_dt)
y_train_pred_1 = dt_model.predict(X_train_dt)
evaluate_model_performance(y_train_dt, y_train_pred_1, y_test_dt, y_test_pred_1)
make_confusion_matrix(y_test_pred_1,y_test_dt)
# Visualize the tree
show_decision_tree(dt_model_1, X_train_dt)
π Model Performance Summary:
| Β | Train | Test |
|---|---|---|
| Precision | 0.9962 | 0.7730 |
| Accuracy | 0.9942 | 0.7822 |
| Recall | 0.9952 | 0.9572 |
| F1 Score | 0.9957 | 0.8553 |
This tree is much easier to visualize. However, while the Recall on this model is high for both the training and test sets, the other scores are lower than we would like. We've swung too far in the other direction. Let's try further refinement.
Using GridSearch for Hyperparameter tuning of our tree modelΒΆ
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': [3, 5, 10],
'min_samples_leaf': [2, 4, 10],
'min_samples_split': [2, 5, 10],
'max_leaf_nodes': [10, 15, 20],
'min_impurity_decrease': [0.001, 0.01, 1],
'criterion': ['gini', 'entropy']
}
# Type of scoring used to compare parameter combinations
# F1 score is important for a balanced view which is needed here
acc_scorer = make_scorer(f1_score, average='binary', pos_label=1)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_dt, y_train_dt)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train_dt, y_train_dt)
# Evaluate the model
y_test_pred_2 = estimator.predict(X_test_dt)
y_train_pred_2 = estimator.predict(X_train_dt)
evaluate_model_performance(y_train_dt, y_train_pred_2, y_test_dt, y_test_pred_2)
make_confusion_matrix(y_test_pred_2,y_test_dt)
# Visualize the tree
show_decision_tree(estimator, X_train_dt)
# Visualize feature importance
feature_visualization(estimator, X_dt)
π Model Performance Summary:
| Β | Train | Test |
|---|---|---|
| Precision | 0.8519 | 0.8526 |
| Accuracy | 0.8379 | 0.8379 |
| Recall | 0.9187 | 0.9176 |
| F1 Score | 0.8840 | 0.8839 |
This model performs consisently well across the metrics between the training and test set, and has addressed our issues with both overfitting and tree complexity. This is a winner!
Cost Complexity PruningΒΆ
This prunes (cuts back) a decision tree after fully growing it.
It removes branches that have very little contribution to reducing impurity (i.e., that donβt really help make better predictions).
It tries to balance:
Model complexity (tree size) β
Training error (fit to training data) β
β Goal = Simpler tree + better generalization.
def plot_alphas(x_axis, y_axis, x_label, y_label, title, y_2=None):
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(x_axis, y_axis, marker='o', drawstyle="steps-post")
if y_2:
ax.plot(x_axis, y_2, marker='o', drawstyle="steps-post")
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.set_title(title)
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train_dt, y_train_dt)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
plot_alphas(ccp_alphas, impurities, "effective alpha", "total impurity of leaves",
"Total Impurity vs effective alpha for training set")
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train_dt, y_train_dt)
clfs.append(clf)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
train_scores = [clf.score(X_train_dt, y_train_dt) for clf in clfs]
test_scores = [clf.score(X_test_dt, y_test_dt) for clf in clfs]
plot_alphas(ccp_alphas, train_scores, "alpha", "accuracy",
"Accuracy vs alpha for training and testing sets", test_scores)
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
# Evaluate the model
y_test_pred_3 = best_model.predict(X_test_dt)
y_train_pred_3 = best_model.predict(X_train_dt)
evaluate_model_performance(y_train_dt, y_train_pred_3, y_test_dt, y_test_pred_3)
make_confusion_matrix(y_test_pred_3, y_test_dt)
# Visualize the tree
show_decision_tree(best_model, X_train_dt)
# Visualize feature importance
feature_visualization(best_model, X_dt)
π Model Performance Summary:
| Β | Train | Test |
|---|---|---|
| Precision | 0.9309 | 0.8990 |
| Accuracy | 0.9268 | 0.8873 |
| Recall | 0.9626 | 0.9377 |
| F1 Score | 0.9465 | 0.9179 |
While using Cost Complexity Pruning resulted in the best F1 score on our test data, the resulting tree was ridiculously complex and difficult to read.
Model Performance Comparison and ConclusionsΒΆ
def evaluate_multiple_models(train_test_pairs):
"""
Calculates and displays Precision, Accuracy, Recall, and F1 Score
for multiple models, showing both Train and Test scores side-by-side.
Parameters:
- train_test_pairs: List of tuples
Each tuple is (y_train, y_pred_train, y_test, y_pred_test)
Assumes order of models: ["LR", "DT", "DT PP", "DT GSCV", "DT CCP"]
"""
models = ["LR", "DT", "DT PP", "DT GSCV", "DT CCP"]
rows = ['Precision', 'Accuracy', 'Recall', 'F1 Score']
columns = pd.MultiIndex.from_product([models, ['Train', 'Test']], names=["Model", "Set"])
# Empty DataFrame to populate
results_df = pd.DataFrame(index=rows, columns=columns)
for idx, (y_train, y_pred_train, y_test, y_pred_test) in enumerate(train_test_pairs):
model = models[idx]
# Train scores
results_df[(model, 'Train')] = [
precision_score(y_train, y_pred_train, average='binary'),
accuracy_score(y_train, y_pred_train),
recall_score(y_train, y_pred_train, average='binary'),
f1_score(y_train, y_pred_train, average='binary')
]
# Test scores
results_df[(model, 'Test')] = [
precision_score(y_test, y_pred_test, average='binary'),
accuracy_score(y_test, y_pred_test),
recall_score(y_test, y_pred_test, average='binary'),
f1_score(y_test, y_pred_test, average='binary')
]
print("\nπ Model Comparison Table (Train & Test Scores):")
display(results_df.style.format("{:.4f}"))
evaluate_multiple_models([
(y_train_lr_2, y_pred_train_3, y_test_lr_2, y_pred_test_3),
(y_train_dt, y_train_pred, y_test_dt, y_test_pred),
(y_train_dt, y_train_pred_1, y_test_dt, y_test_pred_1),
(y_train_dt, y_train_pred_2, y_test_dt, y_test_pred_2),
(y_train_dt, y_train_pred_3, y_test_dt, y_test_pred_3),
])
π Model Comparison Table (Train & Test Scores):
| Model | LR | DT | DT PP | DT GSCV | DT CCP | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Set | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test |
| Precision | 0.8278 | 0.8328 | 0.9962 | 0.9100 | 0.9962 | 0.7730 | 0.8519 | 0.8526 | 0.9309 | 0.8990 |
| Accuracy | 0.7991 | 0.8065 | 0.9942 | 0.8754 | 0.9942 | 0.7822 | 0.8379 | 0.8379 | 0.9268 | 0.8873 |
| Recall | 0.8854 | 0.8911 | 0.9952 | 0.9041 | 0.9952 | 0.9572 | 0.9187 | 0.9176 | 0.9626 | 0.9377 |
| F1 Score | 0.8556 | 0.8610 | 0.9957 | 0.9070 | 0.9957 | 0.8553 | 0.8840 | 0.8839 | 0.9465 | 0.9179 |
- Logistic Regression (LR) is stable but has lower ceiling performance
Train vs. Test scores are very close β no overfitting.
Precision and recall are both decent (~83% precision, ~89% recall).
F1 Score (~86%) is balanced but lower compared to tree models.
β Generalizes well, β less powerful at capturing complex patterns.
Conclusion: Logistic Regression is a very safe, generalizable model, but it's slightly weaker in predictive power for this task.
- Unpruned Decision Tree (DT) massively overfits
Train scores are near perfect (99%), test scores drop sharply.
Huge gap: 99% training vs. 87.5% test accuracy.
F1 drops from 0.9957 (train) to 0.9070 (test).
β Classic overfitting β the tree memorizes training data patterns too tightly.
Conclusion: Pure decision trees without pruning are overfit and unreliable for real-world hotel cancellation prediction.
- Pre-Pruning (DT PP) solves overfitting somewhat, but recall becomes dominant
Very high recall (test recall = 95.7%) β catches almost all cancellations.
Precision drops to 77.3% β lots of false positives.
Accuracy and F1 Score lower than LR and GSCV models.
β Model tends to overpredict cancellations.
Conclusion: Pre-pruning leads to high recall (good for catching cancellations) but sacrifices precision and overall balance.
- Grid Search CV (DT GSCV) improves balance
Precision (85%), Recall (91%), F1 Score (~88%).
No major overfitting (train/test results very close).
Stronger overall performance than basic LR or DT PP.
β Adequately simple and readable Decision Tree with balanced metrics.
Conclusion: GridSearchCV tuning successfully balances the decision tree, leading to a solid, balanced model for cancellation prediction.
- Cost-Complexity Pruning (DT CCP) gives the best generalization
Very high precision (89.9%), recall (93.8%), and F1 Score (91.8%).
Highest test accuracy (88.7%) among all models.
Very small train/test gap β great generalization.
Best balance across all metrics, but grossly unreadable tree. A poor choice for presenting results to executives.
Conclusion: Cost Complexity Pruning produces the most balanced metrics and the highest F1 score, but the complexity of the decision tree makes it unusable for presentations. We would only select this if no presentation was needed.
π Final Recommendation:
- Decision Tree + GridSearchCV (DT GSCV) is the best model in general. It has a readable tree while still producing well-balanced results.
- Backup choice if more simplicity is needed β Logistic Regression.
- Backup choice if no presentation is needed β Decision Tree + Cost Complexity Pruning.
π Quick Takeaways:
| Model | Good For |
|---|---|
| LR | Simplicity, stability, fast inference |
| DT (no pruning) | β Overfits badly |
| DT PP | High recall, but low precision |
| DT GSCV | Good balance with readable tree |
| DT CCP | Best overall generalization, but unreadable tree |
Actionable Insights and RecommendationsΒΆ
- What profitable policies for cancellations and refunds can the hotel adopt?
- What other recommedations would you suggest to the hotel?
1. Guest Behavior InsightsΒΆ
- From EDA, October, September, and August are the busiest months.
- The Online segment is the dominant source of guests.
- Repeating guests cancel at a very low rate (only 1.72%).
- Guests making special requests cancel significantly less frequently.
Recommendations:
- Increase room rates and staffing during October, September, and August.
- Focus marketing campaigns and budget primarily on Online channels.
- Expand loyalty programs to encourage repeat guests and offer softer policies for them.
- Encourage special requests during the booking process to lower cancellation rates.
- Consider finding a way to "lock in" low-risk bookings in the system to ensure they don't get overbooked.
2. Market Segment and Pricing InsightsΒΆ
- Online guests pay the highest room rates on average (~β¬112).
- Aviation and Offline segments pay moderately high rates.
- Complementary guests pay almost nothing (~β¬3).
- Offline and Complementary bookings have higher cancellation rates.
Recommendations:
- Upsell premium rooms and packages to Online and Corporate guests.
- Restrict or monitor complimentary bookings unless tied to loyalty or marketing initiatives.
- Require advance payment or deposits for Offline bookings to protect revenue.
3. Cancellation Behavior InsightsΒΆ
- Around 32.76% of all bookings are canceled.
- Bookings with no special requests have the highest cancellation rates (~43%).
Recommendations:
- Implement tiered cancellation policies:
- Non-refundable option: offer a 5β10% discount.
- Full refund policy only if canceled more than 7 days in advance.
- Encourage guests to submit special requests during the booking flow.
- Charge cancellation fees (e.g., one night's stay) for last-minute cancellations (within 3 days of arrival).
- Determine the average rate of cancellation for any given date and consider allowing overbooking of rooms by that amount, so that all rooms can be filled.
4. Profitable Policies for Cancellations and RefundsΒΆ
- Offer non-refundable booking options with small discounts.
- Apply stricter refund rules during peak months (October, September, August).
- Require partial prepayment for high-risk segments (Offline).
- Allow flexible cancellation only for loyalty members (3+ stays).
- Offer early bird promotions for early commitment bookings, as bookings made 91+ days in advance are far less likely to cancel.
5. Additional RecommendationsΒΆ
- Implement dynamic pricing based on demand, occupancy, and competitor analysis.
- Launch targeted email marketing campaigns based on booking behavior (early bookers, repeat guests, corporate clients).
- Build partnerships with Online Travel Agencies (OTAs) to enhance visibility.
- Track special requests trends and refine service based on guest preferences.
- Gather post-booking engagement to strengthen guest commitment (e.g., "Would you like a quiet room or a view room?").